To understand why it's not straightforward to tunnel your data from your cloud infrastructure like AWS to your local box, you'll need to understand how Kafka works. I'll leave the details to the documentation and explain only what's necessary with a bit of simplification:
Small Recap on Kafka
A cluster of Kafka brokers ensures high-availability and fault-tolerance. A topic (log of data) is replicated among a number of different other brokers to prevent downtime for this particular topic. For each topic, there's an elected leader broker which organizing the writing to the topic. When connecting to any Kafka broker specifying the topic for consumption. The broker will redirect to the broker given the DNS name which stores the topic.
The advanced tunnelers of you might understand the problems to solve already. First, we need to forward all the brokers to ensure we've got the one and its replicas forwarded which store the topic. Second, how can we forward all the different brokers, they are bound all to the same port and we cannot change it (since they expect to connect to this port). Third, the Kafka broker will advertise the right broker to connect to using an internal DNS name, e.g. ip-172-31-16-102.eu-west-2.compute.internal, which is not reachable from the outside.
The first problem is easily solvable, however, the second problem is a bit harder to tackle. Luckily, I found the project called 'kafka-tunnel' from the consultancy Simple Machines which is a good starting point. The command line tool expects you to specify a jump host (over which you get access to your private AWS subnet), your Zookeeper boxes (if you need those), and broker IP addresses. It will automatically create local network interfaces with the same IP addresses as in the cloud environment. The only thing left to make it work is to redirect DNS resolution from your internal DNS names of your brokers to the IP addresses that are bound locally. You can do that by simply adding entries to your /etc/hosts e.g. 172.31.16.102 ip-172-31-16-102.eu-west-2.compute.internal.
Better Developer Experience
Even though the presented workaround is clearly a hack and is not scalable for larger setups, I hope this little piece will save you a headache. It's amazing to see the Kafka ecosystem grow bigger every year. And while enterprise clients are the main focus for the leading consultancy companies such as Confluent behind Kafka, I hope that the developer experience for juniors and startups will also improve over time. Tunneling the cluster for easier development is a feature I definitely like to see officially supported.
If you like to (or are obliged to) mock your input data to test your Kafka Streams topologies, I'd like to refer to the testing framework Mocked Streams that I developed with the help of many contributors to delight your day to day with sweet Scala sugar code.