Archive for the ‘Tips and Tricks’ Category
Home made Dropbox with inosync
I’ve recently started using Dropbox. I have long looked for an application which I could use to keep some things in sync between the various computers I use at home and at work.
I have looked for note taking applications to save tidbits of interesting information for later consultation. Google Docs and Tomboy are both pretty close to accomplish that but I have found that Tomboy lacks flexibility and there is no simple way yet to push informations to Google Docs.
Dropbox came into the picture because a friend helped me boost my account to 10 GB. It then struck me that a file syncing service such as Dropbox would be as good as any note taking application for saving information across several machines.
Dropbox has many things for it, but it’s just a very well executed implementation of real-time file synchronization, something that I thought someone could, in theory reproduce by using INotify and rsync.
Well, someone did just that, and it does exactly what Dropbox is doing. It’s much less pretty, or, might I say, more UNIXy, than using Dropbox. Inosync is a Python program that uses INotify and rsync to keep some directories on some computer remotely synchronized in real time.
Using inosync between 2 computers isn’t quite as simple as simply installing and launching a software installer, but can be used to achieve the same result with mimimal configuration.
Let’s say you have 2 Linux computers named ComputerA and ComputerB that contain a directory that you want to keep in mutual synchronization. After you install inosync, create the following inosync.py file on ComputerA.
wpath = "/home/neumann/sync"
rpath = "/home/neumann"
rnodes = [
"ComputerA:" + rpath
]
edelay = 1
And the following inosync.py file on ComputerB.
wpath = "/home/neumann/sync"
rpath = "/home/neumann"
rnodes = [
"ComputerB:" + rpath
]
edelay = 1
Launch inosync on both computers using this command.
inosync -vc inosync.py
The -v switch will of course make the daemon more verbose so you can know what is going on.
inosync isn’t meant to be used for bidirectional synchronization. Using it as such as the consequence that synchronizing a file to a computer will cause the other to attempt to synchronize it back to the source. The only real consequence of that is a loss of CPU cycles because the loop ends after both sides have agreed there is nothing to synchronize.
inosync can thus be used to accomplish the same job as Dropbox and by using it your only space quota will whatever is left on the smallest HD you put to use in the synchronization set. It’s also a solution if you don’t like putting for files out of your reach. If you have a remote VPS with space to spare, you can use inosync just like a Dropbox server.
I’ve wrote before that the disadvantage of this method is that it’s a lot less pretty that what Dropbox does on the desktop. There is no notification, and, unless you like to look at log files, no way to know when a synchronization is complete. Also, synchronizing between several computer become a serious pain, and flat out impossible beyond a certain threshold.
I must also add that I’m not using this myself. I’m using Dropbox because I don’t plan to setup a remote server just for that purpose. While I don’t fully trust Dropbox with my most precious files, for most of the content I work with, their storage setup is probably a lot more reliable than my home servers with their aging hard drives. I’m sorry if this sound like an advert for Dropbox, but I have to admit they are doing a very good job with their product.
How you know your SD card is badly broken?
I’ve recently bought a 16 GB SDHC card from a Hong Kong seller on eBay. You might think that I was looking for trouble not to buy from a brand name card. Yet, the card was half the price of brand names cards and… it might have worked! I got refunded the card pretty quickly by the seller.
The card was properly detected and could be read by the 2 SD card reader I have at home so I initially thought I had made a good deal. Troubles started to happen when I tried read data I had written on it. The filesystem always ended up in pieces. A bad sector check did not report anything bad.
The final verdict on the quality of the card was determined by an interesting experiment I’ve done. Here is what I tried.
sudo dcfldd if=/dev/zero of=/dev/sdc
dcfldd is an extension to the venerable dd command. It adds, amongst other features, a transfer status output. This is a welcome improvement from the silent nothing that is running dd. If there is anything you need to remember from this post, it is the existence dcfldd command (in the dcfldd package in Debian and Ubuntu).
Once the card is supposedly filled with zeroes, use the same command to read the content:
sudo dcfldd if=/dev/sdc of=BORKED_SD
Then hexdump (hd) the content of BORKED_SD:
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 6abfea50 00 00 00 3f 00 00 00 00 00 00 00 00 00 00 00 00 |...?............| 6abfea60 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 74a13000 00 22 00 00 00 06 00 00 00 c7 00 00 00 09 00 00 |."..............| 74a13010 00 00 00 00 00 02 00 00 00 00 00 00 00 0c 00 00 |................| 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * ...
Beyond this point, the hexdump data starts to become random.
The hexdump would have only output one line if the card had correctly been filled with zeroes. The data you see on the next lines should not be there. Either the card returns bad data, or bad data was somehow written on it. You can then understand why the filesystem on the card constantly got corrupted.
Don’t ever break MY trunk!
Nico’s last blog post touches a subject that has been in my mind for some time now. I must first say that I don’t write this text strictly in reaction to Nico’s post and that I have not verified with if he acknowledges the points I’m about to make. During the time I’ve spent at Kryptiva it was pretty common to see what I will call WiP (Work-in-progress) commits pushed in our team Mercurial repositories. The reason usually given for pushing broken or incomplete changesets to repositories are the one cited by Nico: people need to backup big changes they are making, or want to complete those changes from another computer.
It would be unacceptable to commit WiP changes on a centralized source control system like SubVersion or CVS because the repository can be checked out by other users at any point in time. Those user tend to expect a working repository even if checking out from a public repository usually means there is a risk that whatever you are checking out will not work. At least, the minimum expectation is that the checked out copy will be compilable.
In distributed version control system (DVCS), like git, everybody commits on it own copy of a repository. Changes get pushed across repositories in discrete bundles. Unless the programming was careless, what ends up in the master repository usually is correct. So, even if programmers have committed broken changes at some point in the repository history, people that clone the repository will usually get a sound copy.
Committing broken code will rarely if ever hurt if all you work on are personal and/or small scale, ashort term, projects. If you are a single programmer tracking changes to a project will git and want to break your trunk every so often, then, go on, be my guess. You are the only person that will suffer your broken history. If you work in a group with several distributed repositories, then you need to read the rest of this post to understand why committing broken trunks is a bad thing.
History
The history of a code repository is the documentation of all the changes that was ever done to a project during its lifetime. As is, it’s the only external documentation that programmer will continuously maintain. This is not something that is obvious when working on projects that have a few tens or maybe hundreds commits. As long as the whole project fits in your head, it is unlikely that you will need to refer to the project change history. This happens when the project stretches over long time periods and has over thousands of commit. The change history is also something that is very useful when a project changes hand.
WiP commits come into this picture because they usually come with a commit message that not very explicit: “work in progress”, “to be continued”, “I’m not done”, “Finishing tomorrow”, etc. Such a message is extremely not useful if you need to inspect the project history, a blame/annotate log.
In effect, the WiP changesets are separated from the documentation of the change that usually happens at the last commit done on the feature. Tracking back the reason of the change is never unworkable but gets progressively more difficult as the project and the repository age.
Bissection
Bissection is actually a debugging technique that is mostly exclusive to the use of DVCS. It is a way to find regressions in the repository history by testing past commits using a binary search pattern. At each step of the bissection procedure, the DVCS system updates the repository, putting it in a state represented by a past changeset. The automated bissection procedure then leave the programmeur to test the resulting repository. The programmer should at that point run automated tests or reproduce the problem manually.
This graph represents a set of commit in a repository. The solid lines are connected changesets in the project history. The dashed line represents the changesets touched by the bissection procedure. In this picture, the initial broken changeset is F and the first known good changeset is A. The changeset consulted are, in order, D, B, then C, which is then found to be the changeset that introduced the bug.
This graph illustrates what happens when a few WiP commits are introduced in the tree. WiP commits means the project can’t be compiled at all point in its history which it might be impossible to find a regression using bissection.
This is the most serious problem that can happen if you commit broken code to a repository used by a team. It can seriously hamper debugging in big shared repositories.
To be continued…
If you are not impressed by the 2 reasons I explain here, then you need to read my next post. I think the best reason not to commit broken code is that DVCS offers you all the tools you need to make proper commit. I’ll explain how this is possible with Git and Mercurial in my next post on this subject.
Getting the handle off any Outlook window
When embedding some window inside Microsoft Outlook, it is not understandable that at some point you need the handle of some an Outlook window object, an Outlook Inspector or an Explorer. The Outlook Object Model does not expose a method to obtain the handle to a window. This is based on some information from Dmitry Streblechenko.
Yes, but not in VBA: you need to QI the Inspector (or Explorer) object for the IOleWindow interface, then call IOleWindow::GetWindow()
If you work with low-level Microsoft Outlook you will eventually find some information, very often forum posts, answered by Dmitry. You will also quickly learn that he is very often right.
I have written the following code in the very last days I have worked on the EchoTracker. It was part of refactorization I have not had time to finish so this code is UNTESTED. This is based solely on my interpretation of the indication of Dmitry.
/// <summary>
/// Embed the Outlook panel in *any* Outlook explorer. Thanks Dimitry.
/// http://www.pcreview.co.uk/forums/thread-1837879.php
/// </summary>
public static MSOWindow GetOutlookWindow(Outlook.Explorer olExp)
{
IntPtr olExpUnk = IntPtr.Zero;
IntPtr oleWinPtr = IntPtr.Zero;
IntPtr hWnd = IntPtr.Zero;
Guid oleWinGuid = typeof(IOleWindow).GUID;
IOleWindow oleWin = null;
try
{
olExpUnk = Marshal.GetIUnknownForObject(olExp);
oleWinPtr = IntPtr.Zero;
if (Marshal.QueryInterface(olExpUnk, ref oleWinGuid, out oleWinPtr) != 0)
throw new Exception("QueryInterface failed.");
oleWin = (IOleWindow)Marshal.GetObjectForIUnknown(oleWinPtr);
if (oleWin == null)
throw new Exception("GetObjectForIUnknown failed.");
oleWin.GetWindow(out oleWinPtr);
}
finally
{
if (oleWin != null) Marshal.ReleaseComObject(oleWin);
}
return new MSOWindow(hWnd);
}
For this code to hopefully work, you need to have the COM interop declaration for the IOleWindow interface. You can find this information on pinvoke.net.
Please, if you stumble upon that code, and happen to have a need for it, use it, or adapt it to your need, leave a comment on this post. I repeat that this is untested. I have no plan to test it, I don’t have a machine on which I can develop Microsoft Outlook.
Getting 2 C# Outlook addins to talk together
The Outlook Object Model (OOM) exposes the COMAddins collections of COMAddin object which can be used by Outlook plugins to communicate together. The communication needs to be done through a COM interface. C# and the .NET framework makes it very easy.
You first need to make a ComVisible interface which the caller will use to communicate with the callee. The sample we will work with is a simple class that will call System.Windows.Forms.MessageBox.
using System;
using System.Runtime.InteropServices;
namespace MessageBox
{
[ComVisible(true)]
public interface IMessageBox
{
void MessageBox(String msg);
}
}
Next you need to make the callee addin. This addin can be made using VTSO or without it. Addin Express would work too. The callee of course needs to implement IMessageBox interface.
[ComVisible(true)]
[ComDefaultInterface(typeof(IMessageBox))]
public partial class ThisAddIn : IMessageBox
{
public void MessageBox(String msg)
{
System.Windows.Forms.MessageBox.Show("MessageBox() call: " + msg);
}
protected override object RequestComAddInAutomationService()
{
return this;
}
/* ... snip ... */
}
This is not the full code of the addin. I have removed the boring part generated by VTSO.
There are 2 important things to notice in the above code. The first is the ComDefaultInterface attribute, which defines which default interface is exposed to COM by the addin. This is important because the ThisAddin class derives from a non-COM-visible class. The page on the NonCOMVisibleBaseClass Managed Debugging Assistant (MDA) has the information on why this is important.
The next important thing is the implementation of the RequestComAddinAutomationService, which return an instance of the COM class to the caller. This is part of the communication protocol between addins. You can make this method return any instance of a COM visible objects. We’ve used the addin class itself to keep things simple.
Finally, the caller code amounts to accessing the COMAddins collection and finding the right object inside it to get the interface.
object msgBoxID = "MessageBox.Addin";
Office.COMAddIns addins = null;
Office.COMAddIn msgBox = null;
IMessageBox imsgBox = null;
try
{
addins = Application.COMAddIns;
msgBox = addins.Item(ref msgBoxID);
imsgBox = (IMessageBox)msgBox.Object;
// Actually use the other addin interface.
imsgBox.MessageBox("Hello, this is a selection change.");
}
finally
{
if (addins != null) Marshal.ReleaseComObject(addins);
if (msgBox != null) Marshal.ReleaseComObject(msgBox);
}
This code is pretty straightforward. The 2 important things to notice are the ref object parameter to the COMAddin.Item method and the fact that you need to use the ProgID of the callee addin when searching inside the COMAddins collection. I put emphasis on this because I lost some time trying to find the ProgID of the IMessageBox interface. The ProgID of the addin, when using VTSO, is usually the name of the assembly.
This research was done in the context of the development of the Echotracker although it is still way too early to say what feature will be built on top of that.
Some iptables exploration
As far as I know, it is not that well understood that you can control the Linux firewall (iptables) on a per-user basis, which is something that is sometimes useful on multiuser systems.
Per-user control can be done using the -m owner command line switch of iptables. This matching is of course to be done only on outbound packets put on the OUTPUT chain.
The -m owner match
The owner rule allow to match outgoing packet in several interesting ways. You can of course match on the UID and GID of any user on the system using --uid-owner and --gid-owner. Those 2 arguments match type cover most of the ground you might want to cover in controlling user network access.
The 2 other switch allow you to match on a process ID and a session ID (--pid-owner and --sid-owner). I can see this type of match used inside a daemon launch script to control to which host the process can communicate. Session ID is a lesser known UNIX concept which can match several process launched from a parent. It is good to know those conditions exists but I won’t be discussing them here.
Selectively drop outbound connection
This is something that you may have good reasons to do on a multiuser system. The following rule prevents any outbound connection to any site on port 80, preventing any non-root users from connecting to most of the web.
iptables -A OUTPUT -p tcp --dport 80 -m owner ! --uid-owner 0 -j DROP
You can of course use --gid-owner to make a more sensible control using a group.
Prevent incoming and outgoing connections
Eventhough you can’t select which user can receive inbound connection request, you can still prevent a specific user process from accepting connection from the outside. You can do that by dropping packets that take part in the TCP connection handshake.
The TCP connection handshake is a process that begin when a connection is attempted on a service port. The first packet sent in that handshake is a packet that has a special flag, called SYN, raised. If the service port can answer to that connection request, it needs to send a TCP packet with 2 flags, SYN and ACK, raised. In practice, this means that if packets that have the SYN and ACK flag raised are blocked, incoming connections will never succeed.
iptables -A OUTPUT -p tcp --tcp-flags ALL SYN,ACK -m owner ! --uid-owner -j REJECT
This blocks the completion of connections attempts done on all local port by non-root users. You could as well selectively ACCEPT connections only to the well known service on your multiuser machine but you would have to remember to modify your firewall script every time you enable another service. You can also selectively allow certain class of user to receive connections using --gid-owner.
I can’t guarantee the level of security offered by the rules I propose here. There are other alternative than using firewalls to control network security on multiuser system. My last hosting provider HCOOP is using grsec which is a set of patch over the Linux kernel.
Compressing a year of timekeeping in 2 hours
I’m very bad at keeping track of the time I spend working. This tends to require manual input, and something to remind me of doing the input. The later part is where I usually fail and lose interest. This meant that last week I had to input a year worth of timekeeping data in a few hours in a web application for that purpose.
This is not a problem as opaque as it might seem to some people. We use timekeeping at work to keep track of how much time are spent doing specific projects and not to keep a precise account of who is working or not at specific time.
The only place where that data is consigned is in out revision control systems, Mercurial. It has a detailed log of the data that was commited inside a repository and, an explanation why if the commit message was good. Scanning each repository (all 72 of them) with the default log command output would have been undoable.
Luckily, Mercurial has a lesser known feature which allows users to present log data data in a more terse way that the default. This is the --template switch, which is pretty well explained in Mercurial manual.
The command I’m using in the script bellow is something like that:
hg log --template "{date|shortdate} {author|email} {rev}"
Here is an excerpt of the output of this command.
... 2009-09-09 fdgonthier@kryptiva.com 1934 2009-09-21 fdgonthier@kryptiva.com 1935 2009-09-21 fdgonthier@kryptiva.com 1936 ...
So this shows some commit I have done in a specific project during the month of september in 2009. It was then trivial to extract that data from all the repositories to see what I was working on at what date. The following script loops around all my repositories and extract from the log the dates in 2009 where I have commited something. Note that I have added another field in the template, which is the name of the directory containing the Mercurial repository. This will be used to distinguish between projects in the step after the data is obtained.
#!/bin/sh
for i in $(find . -maxdepth 1 -type d | cut -c 3-); do
if [ -e $i/.hg ]; then
echo "Churning $i"
(cd $i; \
hg log \
--template "{date|shortdate} $i {author|email} {rev}\n" |\
grep -E "^2009.*(fdgonthier)") > ~/churn/$i
fi
done
From the files churn directory it’s then trivial to get a picture of everything that was worked on all through the year. Just cat the file together and sort the whole set of lines by date.
> cd ~/churn && cat * | sort | less ... 2009-03-03 bar-daemon fdgonthier@kryptiva.com 1803 2009-03-03 bar-daemon fdgonthier@kryptiva.com 1804 2009-03-04 libfoo fdgonthier@kryptiva.com 5 2009-03-04 libfoo fdgonthier@kryptiva.com 6 2009-03-04 bar-daemon fdgonthier@kryptiva.com 1805 2009-03-04 bar-deamon fdgonthier@kryptiva.com 1806 ...
This will be as accurate as you keep your repositories clean. For example, it might be difficult to extract only the changesets you did if you did not pay attention to correctly configuring your default commit name. It happened to me in some contexts. I also had to use the revision number of the log to the content of some commits because I could not remember to what subproject they were attached.
This is not something you want to have to do. It’s much more accurate and easy to properly feed the timetracking program on a daily basis. There is no excuse not do to it properly, but if you tend to forget that kind of thing, this trick can help.
LD_PRELOAD fun
Here is a welcome digression from my previous Twitter oriented posts. I’m starting to play around with the LD_PRELOAD feature in the Linux dynamic linker. For those who might not know what this feature is, here is the description from ld.so (8).
LD_PRELOAD
A whitespace-separated list of additional, user-specified, ELF
shared libraries to be loaded before all others. This can be
used to selectively override functions in other shared
libraries. For setuid/setgid ELF binaries, only libraries in
the standard search directories that are also setgid will be
loaded.
So in pratical term, any libraries you specify in the LD_PRELOAD environment variable will loaded before any system libraries. This means that dynamic symbols in a loading program will be first searched in those libraries before being searched anywhere else. This means you can override any defined symbol you want in standard libraries.
Let’s start with a rather juvenile example. This will change the behavior of the read (2) function in order to make the user believe a file might have a different content.
ssize_t read(int fd, void *buf, size_t count) {
static int done = 0;
if (!done) {
char silly_str[] = &quot;Haha you got overriden.\n&quot;;
size_t s = count &amp;gt; sizeof(silly_str) ? sizeof(silly_str) : count;
memcpy(buf, silly_str, s);
done = 1;
return s;
}
else return 0;
}
If you compile this inside a library that is called, for example, libread.so, you can test this code by running:
> /bin/cat /etc/fstab # /etc/fstab: static file system information. # ... > LD_LIBRARY_PATH=. LD_PRELOAD=libread.so /bin/cat /etc/fstab Haha you got overriden.
That in itself is just a rather silly prank you can play on your friend’s computer if you happen to have access to it. Experienced programmer will start seeing potential uses for LD_PRELOAD. I am getting to that.
The subject of our next example will be the honorable ls (1). ls uses the opendir (3) function to open a directory and browse its files. It should react properly if it can’t open the directory. One way to test this is to make opendir() return NULL and observe how the caller reacts. You can do that using LD_PRELOAD.
DIR *opendir(const char *name) {
return NULL;
}
> LD_LIBRARY_PATH=. LD_PRELOAD=libls1.so /bin/ls /tmp /bin/ls: cannot open directory /tmp
What can you do now if you want to preserve part of the behavior of the function, or modify they result it returns? Your preloaded library will then need to use libdl to dynamically load the function it wants to modify the behavior.
The following example is a very simple override of the opendir (3) function which open a different directory than what the caller expects. I will explain more in detail the details of this function below.
DIR *opendir(const char *name) {
DIR *(*libc_opendir)(const char *name);
*(void **)(&libc_opendir) = dlsym(RTLD_NEXT, "opendir");
return libc_opendir("/tmp");
}
libdl is fortunately very simple to use. The naive approach would be to use dlopen (3) to open the C library, then get the pointer to the function you are calling using dlsym (3). In theory, this technique is valid and working, but doing that circumvents the LD_PRELOAD mechanisme because preloaded libraries can be chained and calling directly into the C library prevents other caller to override our own function.
In practice, calling dlopen() on libc on an Ubuntu Karmic system made some program crash and burn for reasons I will not attempt to explain. The next technique should be preferred on Linux system, especially when dealing with the system C library.
dlsym() has an option that makes the Linux dynamic linker search for the right symbol to be override. This is the RTLD_NEXT flag, which is to be used just for the purpose of wrapper dynamic library functions.
libdl the task of returning the pointer to the right symbol. The RTLD_NEXT option to dlsym() returns the right symbol.
The next and final example of the use of LD_PRELOAD will still use the valiant ls. In time for Christmas, this will modify the output of ls by randomizing the d_type field returned in the dirent structure by readdir (3). If you use colorized ls output, and I believe most of you probably do, you should see a pretty display of color whenever you list a directory by preloading this function.
struct dirent64 *readdir64(DIR *dir) {
static struct dirent64 *(* libc_readdir64)(DIR *dir) = NULL;
struct dirent64 *dent;
unsigned char rnd_dtype[7] = { DT_UNKNOWN, DT_REG,
DT_DIR, DT_FIFO,
DT_SOCK, DT_CHR,
DT_BLK };
if (libc_readdir64 == NULL) {
*(void **)(&libc_readdir64) = dlsym(RTLD_NEXT, "readdir64");
srand(time(NULL));
}
dent = libc_readdir64(dir);
if (dent != NULL)
dent->d_type = rnd_dtype[rand() % 7];
return dent;
}
There is still a problem with this code on my new Ubuntu Hardy machine. The code from the preloaded library hangs before the program terminates. I do not understand why this happen and a search for this bug did not turn up anything. The problem doesn’t happen with Ubuntu Karmic.
There is nothing new about using LD_PRELOAD this way. Several very nice libraries have been built with the intention of modifying the behavior of typical libraries.
- fakeroot: “fakeroot provides a fake root environment by means of LD_PRELOAD and SYSV IPC (or TCP) trickery.”
- fakechroot: fakechroot provides a fake chroot environment to programs.
- libtrash:“[...] the shared library which, when preloaded, implements a trash can under GNU/Linux”
- cowdancer: cowdancer is an userland implementation of copy-on-write filesystem.
There are 29 projects matching LD_PRELOAD on freshmeat.net. You might have used some of them.
The code I have written for this demonstration is available on BitBucket.
On String.intern()
Where the author realizes the significance of the String.intern() method
I might have hinted about in in my previous post on the subject of strings in Java, yet I did not realize the significance of String.intern() method. The following code sample demonstrates the behavior of the String.intern() method, similar to what I demonstrated in the post.
public class TestClass2 {
public static void main(String[] args) {
String s1 = "hello";
String s2 = new String("hello");
// This is going to be false.
if (s1 == s2) System.out.println("s1 == s2");
// This is going to be true.
if (s1 == s2.intern()) System.out.println("s1 == s2.intern()");
}
}
It’s a didactic example at best. It’s when you consider that strings also come from input/output that it String.intern() becomes a thing of interest.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
public class TestClass2 {
public static void main(String[] args) {
String s1 = "hello";
String s2 = null;
try {
// Enter hello at this point.
s2 = new BufferedReader(new InputStreamReader(System.in)).readLine();
// This is going to be false.
if (s1 == s2) System.out.println("s1 == s3");
// This is going to be true.
if (s1 == s2.intern()) System.out.println("s1 == s2.intern()");
} catch (IOException e) {
e.printStackTrace();
}
}
}
As you can see in that example, the String.intern() method returns a reference to the string “hello” already in the constant pool. The virtual machine maintains an table of string instances that can be shared between all the string references in the program.
An immediate and obvious benefit of this technique called String interning is reduced memory footprint because of object reuse. Wikipedia also describes that the technique is also used by programs that need to do fast string comparisons such as compiler. This allow to compare strings by simple comparing references instead of possibly scanning the full length of both strings.
The JDK documentation gives a better description of the behavior of the String.intern() method. It’s a surprise to me that I never took the time to understand this behavior of such a core class of the Java library.
Microsofties might also find interesting that the .NET Framework also has a String.intern() method which behaves approximatively in the same way.
Bits on info on Java strings
Update: I don’t keep IRC logs and thus cited the wrong guy. Sorry Vince.
On this blog entry I will take on an assertion stated by systemfault. He declared on the #programmeur IRC channel on Freenode that:
<systemfault> String foo = new String("lol"); est pareil que String foo = "lol";
<systemfault> Quand le compilo voit: String foo = "lol";
quand il compile, il va vraiment faire String foo = new String("lol");
Which in english means that:
String s = "hello";
is syntaxic sugar for:
String s = new String("hello");
Afterward, on the same subject, this behavior was regarded as a lack of consistency was deemed a fault of the Java language and the subject was closed. If I was a language lawyer, I would bore people to death by including the proper reference in the Java Language Specification. Since I’m more practically minded and that I have some experience with Java bytecode, I will dig and explain a bit of Java internal you my reader (or more optimistically, my readers).
An Elegant Proof
Vince is right that there is some syntactic sugar around Java string, but the example he gave isn’t correct. He failed to see consider the fact that, in Java, string literals are in fact first class objects.
Since This associates the reference s to the "hello" String object,
String s = "hello";
The next snippet creates a new reference s to a String object which is copied from the "hello" String object. It creates another object having the exact same content as the object "hello". Since Java strings are immutable, it’s really not that useful to make to duplicate references to the same string content this way.
String s = new String("hello");
That behavior is summarized by the following code snippet. If you run this code in your Java virtual machine, you will see that all conditions are satisfied.
String s1 = "hello";
String s2 = new String("hello");
if (s1.equals(s2)) System.out.println("String are equal");
if (s1 == "hello") System.out.println("s1 refers to the \"hello\"; object.");
if (s2 != "hello") System.out.println("s2 doesn't refer to the \"hello\" object");
if (s1 != s2) System.out.println("String references are not equal.");
In line 3 we see that both String object have the same content. Line 4 checks that s1 is indeed a reference pointing to the "hello" object. Line 5 shows that even if the content of the string refered to by s2 string is the same as s1, it doesn’t point to the "hello" object. Line 6 further drives that same point home.
The Magic
In the next half of this article, I’ll try to explain a bit why string behave the way they do in Java.
All string objects in Java are stored in what is called the Runtime Constant Pool. This mystery object is compared in the specification to the concept of symbol table that is present in many programming language.
The constant pool includes many informations, including strings, string literals, numeral constants, and references to other class methods.
String literals are load from the constant pool at the moment the class is loaded by the class loader of the virtual machine. All direct access to those literals will refer to the same instance of the object from the pool.
The Java Virtual Machine Specification goes very far to make sure that all String objects loaded by the virtual machine are not duplicated in memory:
The Java programming language requires that identical string literals (that is, literals that contain the same sequence of characters) must refer to the same instance of class String. In addition, if the method String.intern is called on any string, the result is a reference to the same class instance that would be returned if that string appeared as a literal. Thus,
(“a” + “b” + “c”).intern() == “abc”
must have the value true.
And indeed, the 2 conditions in the following program will get fired.
While the VM goes far trying to make sure strings are not duplicated in memory, the fact that new String("hello") creates another object should come as no surprise to a programmer. It’s really a case where DWIM prevails.
It’s also the same principle you can see behind the following code. If you can write a condition such as this "abc" == "abc", you will instinctively expects that the condition “a” + “b” + “c” == “abc” will be true either.
if (("a" + "b" + "c") == "abc") System.out.println("yes it is");
if (("a" + "b" + "c").intern() == "abc") System.out.println("yes it is");
It may make sense to somebody used to object-oriented programming to think that “a” + “b” + “c” should return a new string instance. Since Java, I think, sticks to the principle of least surprise, it would be a strange discrepency if the result of "a" + "b" + "c" would not be comparable to "abc" using ==.
All that work is the result of considering string literals as first class objects in the code. Things would be very much different if Java strings were defined as simple byte arrays like in many other languages. There is more high-calory sugar built around Java string that I might consider for my next blog post.


