word2vec architecture

I was trying to understand the skipgram model of word2vec, and I had some problems in understanding the details. I'm clear about the high level idea - given a word, predict the context of the word. However, when you actually train the model, what is the input and output of the model for a particular training instance? To be more concrete with an example, disregarding all sophisticated techniques like negative sampling etc., if I have the sentence "it is a beautiful day today", the input to the cbow version would be average of one-hot encoding of "it", "is", "a", "day", "today" and the output should ideally be one-hot encoding of "beautiful". For skip-gram, I'm confused - given input one-hot encoding of "beautiful", what should be the output be? Should be average of one-hot encoding of "it", "is", "a", "day", "today" in a single training instance or "it", "is", "a", "day", "today" in 5 separate training instances? I tried to go through the gensim codebase to understand what they do, but it's not clear.


